-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[consumer] Allow annotating consumer errors with metadata #9041
Conversation
41e95d2
to
d417f44
Compare
Happy to see this added. As discussed in #9260, there is a potential to propagate backwards the information contained in I worry about the code complexity introduced to have "success error" responses, meaning |
As discussed in open-telemetry/oteps#238, it would be useful for setting the correct |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9041 +/- ##
==========================================
+ Coverage 92.22% 92.23% +0.01%
==========================================
Files 409 411 +2
Lines 19134 19222 +88
==========================================
+ Hits 17646 17730 +84
- Misses 1127 1131 +4
Partials 361 361 ☔ View full report in Codecov by Sentry. |
Co-authored-by: Bogdan Drutu <bogdandrutu@gmail.com>
Right now we're only adding HTTP/gRPC status code propagation, everything else is marked as unimplemented and still in the design stage. Do you have concerns with HTTP/gRPC status codes, or the way they're communicated?
I see now, I mistook what you said for the Uber
This is meant to hold information related to anything the component would like to communicate upstream. Right now we're starting with HTTP and gRPC status codes, but we would like to add retry information, partial success information, etc. in the future. The goal is to start small and iterate as we design the additional options. |
Can you please respond to the rest of the comments and see how the design of the "ErrorData" would look like? |
Sure thing. The other questions were based on the first question, so I wanted to make sure we were on the same page there if possible.
|
Before answering any of these questions, need to better understand how "correctly" (recommended, or whatever is the right term) chained consumers should error? For me to know the "right" design, I need to understand better the usage. Here are some examples, and we can debate on them what is the correct way (eventually document this). For these examples assume to use logs: First Example
For a request with 10 logs:
Can you please tell me exactly which component should propagate/append errors and how? Second Exmple
For a request with 10 logs:
Can you please tell me exactly which component should propagate/append errors and how? |
Thank you for the in-depth examples and illustrations. Without going too deep into the implementation details of how this happens, here's how I see each example playing out. I'm assuming both pipelines are synchronous, as any asynchronous processing will cut error propagation at that point. First example
Second example
|
How would receiver A do this? It doesn't have the ability when calling My gut is that the receiver should not retry data if it receives any PartialSuccess or PermanentError data from the error. |
The necessary pipeline topology information would be contained within the error container and would be passed in the
I think it's ultimately up to the exporter to determine what combinations of options it supports. It's possible that exporter F uses a protocol where the backend supports idempotent submissions and retryable partial successes are possible. That said, if receiver A is an OTLP receiver and has been configured to pass retry information upstream from the Collector, then it will need to determine whether to retry or return a partial success message since these are mutually exclusive in OTLP. In this case, not retrying seems like the safest course of action, though there could be some nuances here we could discuss in the future. |
If we want to support retries, let's actually configure the "retries" in the components that support them and potentially add more components that can do retries. If you want crazy ideas, let's push them with the request vs supporting dynamic routing. |
In general I would agree that retries should be done within the component that needs to do retries, but the goal here is more so to configure scraping receivers that aren't responding to a request. For example, the filelog receiver could throttle reading based on throttling information from downstream endpoints: open-telemetry/opentelemetry-collector-contrib#20864. The reason we need to have routing like this is to make sure we don't replay requests on exporters that have had successful submissions while retrying exporters that want a retry. cc @dmitryax since you have the most context here. |
@bogdandrutu As discussed offline, here are some code snippets showing roughly how the above examples would work using something similar to the proposed API in this PR. I've tried to keep them as short as possible while still being nearly-runnable code. Note that a lot of this still hasn't been implemented within this PR. Example 1: // A -> B (drops logs with invalid trace identifiers) -> C -> D -> E (exports to Loki)
// Shared library consumption function akin to consumerretry, fanoutconsumer, etc.
func ConsumeLogs(ctx context.Context, ld plog.Logs) error {
err := processorB(ctx, ld)
cerr := consumererror.ErrorContainer{}
if errors.As(err, &cerr) {
errs := cerr.Errors()
partialCount := 0
for _, e := range errs {
count, ok := e.Partial()
if ok {
partialCount = partialCount + count
}
}
if partialCount > 0 {
cerr.SetPartialSuccessAggregate(partialCount)
}
}
}
func receiverA(ctx context.Context, ld plog.Logs) error {
err := ConsumeLogs(ctx, ld)
cerr := consumererror.ErrorContainer{}
if errors.As(err, &cerr) {
if cerr.IsPartialSuccess() {
// Ignore 400 error code because OTLP partial successes use 200 status codes
return http.Response{code: 200, body: fmt.Sprintf("Partial success: %d out of %d logs failed", cerr.PartialCount(), ld.LogRecordCount())}
}
}
}
func processorB(ctx context.Context, ld plog.Logs) error {
failedItems := 2
processorErr := consumererror.New(
errors.New("Logs had trace IDs that failed validation"),
consumererror.WithPartial(failedItems)
)
err := exporterE(ctx, ld)
if err != nil {
return consumererror.Combine(err, processorErr)
}
return processorErr
}
func exporterE(ctx context.Context, ld plog.Logs) error {
return consumererror.New(err,
consumererror.WithPartial(2)
consumererror.WithHTTPStatus(400),
)
} Example 2: // / -> E - > F
// / -> C -> D -
// A -> B - \ -> G -> H
// \
// \ -> I -> J -> K
func receiverA(ctx context.Context, ld plog.Logs) error {
err := processorB(ctx, ld)
cerr := consumererror.ErrorContainer{}
if errors.As(err, &cerr) {
if cerr.ShouldRetry() {
ctx := ctx.WithValue("otelcol-retry-info", cerr.Retry())
for i := 0; i < maxRetries; i++ {
err := processorB(ctx, ld)
if err != nil {
// check error and continue with retry
}
return nil
}
return http.Response{code: 400, body: "could not submit data"}
} else {
return http.Response{code: cerr.HttpStatusCode(), body: fmt.Sprintf("Partial success: %d out of %d logs failed", cerr.PartialCount(), ld.LogRecordCount())}
}
}
}
func fanoutConsumerB(ctx context.Context, ld plog.Logs) error {
exporters := []consumer.ConsumeLogsFunc{processorD, exporterK}
var allErrs error
for _, e := range exporters {
err := e(ctx, ld)
if err != nil {
allErrs := consumererror.Combine(allErrs, err)
}
}
// TODO: Aggregate partial success counts if present
return allErrs
}
func processorB(ctx context.Context, ld plog.Logs) error {
return fanoutConsumerB(ctx, ld)
}
func fanoutConsumerD(ctx context.Context, ld plog.Logs) error {
exporters := []consumer.ConsumeLogsFunc{exporterF, exporterH}
var allErrs error
for _, e := range exporters {
err := e(ctx, ld)
if err != nil {
allErrs := consumererror.Combine(allErrs, err)
}
}
// TODO: Aggregate partial success counts if present
return allErrs
}
func processorD(ctx context.Context, ld plog.Logs) error {
failedItems := 2
processorErr := consumererror.New(
errors.New("Logs had trace IDs that failed validation"),
consumererror.WithPartial(failedItems)
)
err := fanoutConsumerD(ctx, ld)
if err != nil {
return consumererror.Combine(err, processorErr)
}
return processorErr
}
func exporterF(ctx context.Context, ld plog.Logs) error {
return consumererror.New(err,
consumererror.WithPartial(4)
consumererror.WithHTTPStatus(429),
consumererror.WithRetry(
consumerrerror.WithRetryDelay(10 * time.Second),
),
)
}
func exporterH(ctx context.Context, ld plog.Logs) error {
return consumererror.New(err,
consumererror.WithPartial(2)
consumererror.WithHTTPStatus(400),
)
}
func exporterK(ctx context.Context, ld plog.Logs) error {
return consumererror.New(err,
consumererror.WithPartial(1)
consumererror.WithGRPCtatus(codes.InvalidArgument),
)
} |
We had a meeting on 2024-08-27 with the following outcomes:
|
consumer/consumererror/README.md
Outdated
consumererror.New(err, | ||
consumererror.WithRetry( | ||
consumerrerror.WithRetryDelay(10 * time.Second) | ||
), | ||
consumererror.WithGRPCStatus(codes.InvalidArgument), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example must be invalid. codes.InvalidArgument
should not be retryable.
In general, it still seems like all these options provide too much flexibility, making it confusing to use and easy to misuse.
For example, why GRPC and HTTP have to be options? Why can't we just have different constructors like consumererror.NewFromGRPC(grpc.Status)
? The status should have the retry information that we can retrieve if applicable. If we expect exporters to add extra info on top of grpc error, they can manually parse the grpc status and create a collector "internal" error another way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of an intentionally unusual option combination since we describe above how combinations that are invalid in OTLP are handled. I'll move this and change it to a more normal combination so we're not highlighting anything we wouldn't recommend. It is worth noting that this is only non-retryable per the OTLP spec and other protocols could consider this a retryable code.
I still think options are the best path forward even if they allow states that are not valid in OTLP, but I'm open to exploring different approaches. A few questions on how we would transition from HTTP/gRPC options to constructors:
Why can't we just have different constructors like consumererror.NewFromGRPC(grpc.Status)? The status should have the retry information that we can retrieve if applicable.
Wouldn't this still be susceptible to the same issue, where an exporter creates a gRPC status with both codes.InvalidArgument
and retryable information? Also, where would this information be given when using an HTTP constructor like consumer.NewFromHTTP
?
If we expect exporters to add extra info on top of grpc error, they can manually parse the grpc status and create a collector "internal" error another way.
Could you give an example of how we would do this?
Discussion on this PR should be moved to #11085 |
I'm closing this in favor of #11085. Thanks everyone for your thoughts and reviews. |
Description:
Revival of #7439
This explores one possible way to allow adding metadata to errors returned from consumers. The goal here is to allow transmitting more data back up the pipeline if there is an error at some stage, with the goal of it being used by an upstream component, e.g. a component that will retry data, or a receiver that will propagate an error code back to the sender.
The current design eliminates the permanent/retryable error types in favor of a single error type that supports adding signal data to be retried. If no data is added to be retried, the error is considered permanent. Currently there is no distinction made between the signals for the sake of simplicity, the caller should know what signal is used when retrieving the retryable items from the error. Any options for retrying the data (e.g. a delay) are offered as options when adding data to retry.
The error type currently supports a few general metadata fields that are copied when a downstream error is wrapped:
Link to tracking Issue:
Resolves #7047
cc @dmitryax